Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

Building a multi-modal Arabic corpus (MMAC)

Identifieur interne : 000794 ( Main/Exploration ); précédent : 000793; suivant : 000795

Building a multi-modal Arabic corpus (MMAC)

Auteurs : Ashraf Abdelraouf [Égypte] ; Colin A. Higgins [Royaume-Uni] ; Tony Pridmore [Royaume-Uni] ; Mahmoud Khalil [Égypte]

Source :

RBID : Pascal:11-0227820

Descripteurs français

English descriptors

Abstract

Traditionally, a corpus is a large structured set of text, electronically stored and processed. Corpora have become very important in the study of languages. They have opened new areas of linguistic research, which were unknown until recently. Corpora are also key to the development of optical character recognition (OCR) applications. Access to a corpus of both language and images is essential during OCR development, particularly while training and testing a recognition application. Excellent corpora have been developed for Latin-based languages, but few relate to the Arabic language. This limits the penetration of both corpus linguistics and OCR in Arabic-speaking countries. This paper describes the construction and provides a comprehensive study and analysis of a multi-modal Arabic corpus (MMAC) that is suitable for use in both OCR development and linguistics. MMAC currently contains six million Arabic words and, unlike previous corpora, also includes connected segments or pieces of Arabic words (PAWs) as well as naked pieces of Arabic words (NPAWs) and naked words (NWords); PAWs and Words without diacritical marks. Multi-modal data is generated from both text, gathered from a wide variety of sources, and images of existing documents. Text-based data is complemented by a set of artificially generated images showing each of the Words, NWords, PAWs and NPAWs involved. Applications are provided to generate a natural-looking degradation to the generated images. A ground truth annotation is offered for each such image, while natural images showing small paragraphs and full pages are augmented with representations of the text they depict. A statistical analysis and verification of the dataset has been carried out and is presented. MMAC was also tested using commercial OCR software and is publicly and freely available.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">Building a multi-modal Arabic corpus (MMAC)</title>
<author>
<name sortKey="Abdelraouf, Ashraf" sort="Abdelraouf, Ashraf" uniqKey="Abdelraouf A" first="Ashraf" last="Abdelraouf">Ashraf Abdelraouf</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>Faculty of Computer Science, Misr International University</s1>
<s2>Cairo</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
<wicri:noRegion>Cairo</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Higgins, Colin A" sort="Higgins, Colin A" uniqKey="Higgins C" first="Colin A." last="Higgins">Colin A. Higgins</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>School of Computer Science, The University of Nottingham</s1>
<s2>Nottingham</s2>
<s3>GBR</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Nottingham</settlement>
<region type="nation">Angleterre</region>
<region type="région" nuts="1">Nottinghamshire</region>
</placeName>
<orgName type="university">Université de Nottingham</orgName>
</affiliation>
</author>
<author>
<name sortKey="Pridmore, Tony" sort="Pridmore, Tony" uniqKey="Pridmore T" first="Tony" last="Pridmore">Tony Pridmore</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>School of Computer Science, The University of Nottingham</s1>
<s2>Nottingham</s2>
<s3>GBR</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Nottingham</settlement>
<region type="nation">Angleterre</region>
<region type="région" nuts="1">Nottinghamshire</region>
</placeName>
<orgName type="university">Université de Nottingham</orgName>
</affiliation>
</author>
<author>
<name sortKey="Khalil, Mahmoud" sort="Khalil, Mahmoud" uniqKey="Khalil M" first="Mahmoud" last="Khalil">Mahmoud Khalil</name>
<affiliation wicri:level="1">
<inist:fA14 i1="03">
<s1>Computer and Systems Engineering Department, Faculty of Engineering, Ain Shams University</s1>
<s2>Cairo</s2>
<s3>EGY</s3>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
<wicri:noRegion>Cairo</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">11-0227820</idno>
<date when="2010">2010</date>
<idno type="stanalyst">PASCAL 11-0227820 INIST</idno>
<idno type="RBID">Pascal:11-0227820</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000141</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000632</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000169</idno>
<idno type="wicri:doubleKey">1433-2833:2010:Abdelraouf A:building:a:multi</idno>
<idno type="wicri:Area/Main/Merge">000800</idno>
<idno type="wicri:Area/Main/Curation">000794</idno>
<idno type="wicri:Area/Main/Exploration">000794</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">Building a multi-modal Arabic corpus (MMAC)</title>
<author>
<name sortKey="Abdelraouf, Ashraf" sort="Abdelraouf, Ashraf" uniqKey="Abdelraouf A" first="Ashraf" last="Abdelraouf">Ashraf Abdelraouf</name>
<affiliation wicri:level="1">
<inist:fA14 i1="02">
<s1>Faculty of Computer Science, Misr International University</s1>
<s2>Cairo</s2>
<s3>EGY</s3>
<sZ>1 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
<wicri:noRegion>Cairo</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Higgins, Colin A" sort="Higgins, Colin A" uniqKey="Higgins C" first="Colin A." last="Higgins">Colin A. Higgins</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>School of Computer Science, The University of Nottingham</s1>
<s2>Nottingham</s2>
<s3>GBR</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Nottingham</settlement>
<region type="nation">Angleterre</region>
<region type="région" nuts="1">Nottinghamshire</region>
</placeName>
<orgName type="university">Université de Nottingham</orgName>
</affiliation>
</author>
<author>
<name sortKey="Pridmore, Tony" sort="Pridmore, Tony" uniqKey="Pridmore T" first="Tony" last="Pridmore">Tony Pridmore</name>
<affiliation wicri:level="4">
<inist:fA14 i1="01">
<s1>School of Computer Science, The University of Nottingham</s1>
<s2>Nottingham</s2>
<s3>GBR</s3>
<sZ>2 aut.</sZ>
<sZ>3 aut.</sZ>
</inist:fA14>
<country>Royaume-Uni</country>
<placeName>
<settlement type="city">Nottingham</settlement>
<region type="nation">Angleterre</region>
<region type="région" nuts="1">Nottinghamshire</region>
</placeName>
<orgName type="university">Université de Nottingham</orgName>
</affiliation>
</author>
<author>
<name sortKey="Khalil, Mahmoud" sort="Khalil, Mahmoud" uniqKey="Khalil M" first="Mahmoud" last="Khalil">Mahmoud Khalil</name>
<affiliation wicri:level="1">
<inist:fA14 i1="03">
<s1>Computer and Systems Engineering Department, Faculty of Engineering, Ain Shams University</s1>
<s2>Cairo</s2>
<s3>EGY</s3>
<sZ>4 aut.</sZ>
</inist:fA14>
<country>Égypte</country>
<wicri:noRegion>Cairo</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint>
<date when="2010">2010</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Annotation</term>
<term>Arabic</term>
<term>Character recognition</term>
<term>Ground truth</term>
<term>Linguistics</term>
<term>Optical character recognition</term>
<term>Pattern recognition</term>
<term>Statistical analysis</term>
<term>Text</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Texte</term>
<term>Linguistique</term>
<term>Reconnaissance optique caractère</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance forme</term>
<term>Arabe</term>
<term>Réalité terrain</term>
<term>Annotation</term>
<term>Analyse statistique</term>
<term>.</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Linguistique</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Traditionally, a corpus is a large structured set of text, electronically stored and processed. Corpora have become very important in the study of languages. They have opened new areas of linguistic research, which were unknown until recently. Corpora are also key to the development of optical character recognition (OCR) applications. Access to a corpus of both language and images is essential during OCR development, particularly while training and testing a recognition application. Excellent corpora have been developed for Latin-based languages, but few relate to the Arabic language. This limits the penetration of both corpus linguistics and OCR in Arabic-speaking countries. This paper describes the construction and provides a comprehensive study and analysis of a multi-modal Arabic corpus (MMAC) that is suitable for use in both OCR development and linguistics. MMAC currently contains six million Arabic words and, unlike previous corpora, also includes connected segments or pieces of Arabic words (PAWs) as well as naked pieces of Arabic words (NPAWs) and naked words (NWords); PAWs and Words without diacritical marks. Multi-modal data is generated from both text, gathered from a wide variety of sources, and images of existing documents. Text-based data is complemented by a set of artificially generated images showing each of the Words, NWords, PAWs and NPAWs involved. Applications are provided to generate a natural-looking degradation to the generated images. A ground truth annotation is offered for each such image, while natural images showing small paragraphs and full pages are augmented with representations of the text they depict. A statistical analysis and verification of the dataset has been carried out and is presented. MMAC was also tested using commercial OCR software and is publicly and freely available.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Royaume-Uni</li>
<li>Égypte</li>
</country>
<region>
<li>Angleterre</li>
<li>Nottinghamshire</li>
</region>
<settlement>
<li>Nottingham</li>
</settlement>
<orgName>
<li>Université de Nottingham</li>
</orgName>
</list>
<tree>
<country name="Égypte">
<noRegion>
<name sortKey="Abdelraouf, Ashraf" sort="Abdelraouf, Ashraf" uniqKey="Abdelraouf A" first="Ashraf" last="Abdelraouf">Ashraf Abdelraouf</name>
</noRegion>
<name sortKey="Khalil, Mahmoud" sort="Khalil, Mahmoud" uniqKey="Khalil M" first="Mahmoud" last="Khalil">Mahmoud Khalil</name>
</country>
<country name="Royaume-Uni">
<region name="Angleterre">
<name sortKey="Higgins, Colin A" sort="Higgins, Colin A" uniqKey="Higgins C" first="Colin A." last="Higgins">Colin A. Higgins</name>
</region>
<name sortKey="Pridmore, Tony" sort="Pridmore, Tony" uniqKey="Pridmore T" first="Tony" last="Pridmore">Tony Pridmore</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000794 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000794 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:11-0227820
   |texte=   Building a multi-modal Arabic corpus (MMAC)
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024